Sequence to Sequence Model for Video Captioning
نویسندگان
چکیده
Automatically generating video captions with natural language remains a challenge for both the field of nature language processing and computer vision. Recurrent Neural Networks (RNNs), which models sequence dynamics, has proved to be effective in visual interpretation. Based on a recent sequence to sequence model for video captioning, which is designed to learn the temporal structure of the sequence of frames and the sequence model of the generated sentences with RNNs, we investigate how pretrained language model and attentional mechanism can aid the generation of natural language descriptions of videos. We evaluate our improvements on the Microsoft Video Description Corpus (MSVD) dataset, which is a standard dataset for this task. The results demonstrate that our approach outperforms original sequence to sequence model and achieves state-of-art baselines. We further run our model one a much harder Montreal Video Annotation Dataset (M-VAD), where the model also shows promising results.
منابع مشابه
Video Captioning via Hierarchical Reinforcement Learning
Video captioning is the task of automatically generating a textual description of the actions in a video. Although previous work (e.g. sequence-to-sequence model) has shown promising results in abstracting a coarse description of a short video, it is still very challenging to caption a video containing multiple fine-grained actions with a detailed description. This paper aims to address the cha...
متن کاملMulti-Task Video Captioning with Video and Entailment Generation
Video captioning, the task of describing the content of a video, has seen some promising improvements in recent years with sequence-to-sequence models, but accurately learning the temporal and logical dynamics involved in the task still remains a challenge, especially given the lack of sufficient annotated data. We improve video captioning by sharing knowledge with two related directed-generati...
متن کاملAutomatic Video Captioning via Multi-channel Sequential Encoding
In this paper, we propose a novel two-stage video captioning framework composed of 1) a multi-channel video encoder and 2) a sentence-generating language decoder. Both of the encoder and decoder are based on recurrent neural networks with long-short-term-memory cells. Our system can take videos of arbitrary lengths as input. Compared with the previous sequence-to-sequence video captioning frame...
متن کاملReinforced Video Captioning with Entailment Rewards
Sequence-to-sequence models have shown promising improvements on the temporal task of video captioning, but they optimize word-level cross-entropy loss during training. First, using policy gradient and mixed-loss methods for reinforcement learning, we directly optimize sentence-level task-based metrics (as rewards), achieving significant improvements over the baseline, based on both automatic m...
متن کاملConsensus-based Sequence Training for Video Captioning
Captioning models are typically trained using the crossentropy loss. However, their performance is evaluated on other metrics designed to better correlate with human assessments. Recently, it has been shown that reinforcement learning (RL) can directly optimize these metrics in tasks such as captioning. However, this is computationally costly and requires specifying a baseline reward at each st...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2017